18.6 连续栈

18.6　连续栈

历经Go 1.3、1.4两个版本的过渡，连续栈（Contiguous Stack）的地位已经稳固。而且Go 1.5和1.4比起来，似乎也没太多的变化，这是个好现象。

连续栈将调用堆栈（call stack）所有栈帧分配在一个连续内存空间。当空间不足时，另分配2x内存块，并拷贝当前栈全部数据，以避免分段栈（Segmented Stack）链表结构在函数调用频繁时可能引发的切分热点（hot split）问题。

结构示意图：

lo stackguard0 hi +---------------+---------------------------------------------------------------+ | StackGuard | | +---------------+---------------------------------------------------------------+ ⇐ SP

runtime2.go

type stack struct { lo uintptr hi uintptr } type g struct { // Stack parameters. // stack describes the actual stack memory: [stack.lo, stack.hi). // stackguard0 is the stack pointer compared in the Go stack growth prologue. // It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption. stack stack stackguard0 uintptr }

其中stackguard0是个非常重要的指针。在函数头部，编译器会插入一段指令，用它和SP寄存器进行比较，从而决定是否需要对栈空间扩容。另外，它还被用作抢占调度的标志。

栈空间的初始分配发生在newproc1创建新G对象时。

stack2.go

// 操作系统需要保留的区域，比如用来处理信号等等 _StackSystem = goos_windows512ptrSize + goos_plan9512 + goos_darwingoarch_arm*1024

// 默认栈大小 _StackMin = 2048

// StackGuard 是一个警戒指针，用来判断栈容量是否需要扩张 _StackGuard = 640*stackGuardMultiplier + _StackSystem

有几个相关常量值，以Linux系统为例，_StackSystem=0，_StackGuard=640。

proc1.go

func newproc1(fn *funcval, argp *uint8, narg int32, nret int32, callerpc uintptr) *g { newg := gfget(p) if newg == nil { newg = malg(_StackMin) } } func malg(stacksize int32) *g { newg := new(g) if stacksize >= 0 { stacksize = round2(_StackSystem + stacksize) systemstack(func() { newg.stack, newg.stkbar = stackalloc(uint32(stacksize)) }) newg.stackguard0 = newg.stack.lo + _StackGuard newg.stackguard1 = ^uintptr(0) newg.stackAlloc = uintptr(stacksize) } return newg }

在获取栈空间后，会立即设置stackguard0指针。

stackcache

因栈空间使用频繁，所以采取了和cache/object类似的做法，就是按大小分成几个等级进行缓存复用，当然也包括回收过多的闲置块。

以Linux为例，_FixedStack大小和_StackMin相同，_NumStackOrders等于4。

stack2.go

// The minimum stack size to allocate. // The hackery here rounds FixedStack0 up to a power of 2. _FixedStack0 = _StackMin + _StackSystem _FixedStack1 = _FixedStack0 - 1 _FixedStack2 = _FixedStack1 | (_FixedStack1 >> 1) _FixedStack3 = _FixedStack2 | (_FixedStack2 >> 2) _FixedStack4 = _FixedStack3 | (_FixedStack3 >> 4) _FixedStack5 = _FixedStack4 | (_FixedStack4 >> 8) _FixedStack6 = _FixedStack5 | (_FixedStack5 >> 16) _FixedStack = _FixedStack6 + 1

malloc.go

// Number of orders that get caching. Order 0 is FixedStack // and each successive order is twice as large. // We want to cache 2KB, 4KB, 8KB, and 16KB stacks. Larger stacks // will be allocated directly. // Since FixedStack is different on different systems, we // must vary NumStackOrders to keep the same maximum cached size. // OS | FixedStack | NumStackOrders // -----------------+------------+--------------- // linux/darwin/bsd | 2KB | 4 // windows/32 | 4KB | 3 // windows/64 | 8KB | 2 // plan9 | 4KB | 3 _NumStackOrders = 4 - ptrSize/4goos_windows - 1goos_plan9

基于同样的性能考虑（无锁分配），栈空间被缓存在Cache.stackcache数组，且使用方法和object基本相同。

mcache.go

type mcache struct { stackcache [_NumStackOrders]stackfreelist } type stackfreelist struct { list gclinkptr // linked list of free stacks size uintptr // total size of stacks in list }

在获取栈空间时，优先检查缓存链表。大空间直接从heap分配。

malloc.go

// Per-P, per order stack segment cache size. _StackCacheSize = 32 * 1024

stack1.go

func stackalloc(n uint32) (stack, []stkbar) { var v unsafe.Pointer // 检查是否从缓存分配 if stackCache != 0 && n < _FixedStack<<_NumStackOrders && n < _StackCacheSize { // 计算 order 等级 order := uint8(0) n2 := n for n2 > _FixedStack { order++ n2 >>= 1 } var x gclinkptr c := thisg.m.mcache

            // 从对应链表提取复用空间          
            x = c.stackcache[order].list
            // 提取失败，扩容后重试
            if x.ptr() == nil {
                    stackcacherefill(c, order)
                    x = c.stackcache[order].list
            }
            // 调整缓存链表
            c.stackcache[order].list = x.ptr().next
            c.stackcache[order].size -= uintptr(n)
            v = (unsafe.Pointer)(x)
    } else {
            // 大空间直接从 heap 分配
            s := mHeap_AllocStack(&mheap_, round(uintptr(n), _PageSize)>>_PageShift)
            v = (unsafe.Pointer)(s.start << _PageShift)
    }
    top := uintptr(n) - nstkbar
    stkbarSlice := slice{add(v, top), 0, maxstkbar}
    return stack{uintptr(v), uintptr(v) + top}, *(*[]stkbar)(unsafe.Pointer(&stkbarSlice))

}

这个函数代码删除较多，主要是为了不影响阅读。对stackpoolalloc在下面也会介绍。

和前文内存分配器的做法很像，不是吗？我们继续看看如何扩容。

stack1.go

func stackcacherefill(c *mcache, order uint8) { var list gclinkptr var size uintptr // 提取一批复用空间 for size < _StackCacheSize/2 { // 每次提取一个 x := stackpoolalloc(order) x.ptr().next = list list = x size += _FixedStack << order }

    // 保存到 cache.stackcache 数组
    c.stackcache[order].list = list
    c.stackcache[order].size = size

}

有个全局缓存stackpool似乎在充当central的角色。

stack1.go

var stackpool [NumStackOrders]mspan func stackpoolalloc(order uint8) gclinkptr { // 尝试从全局缓存获取 list := &stackpool[order] s := list.next // 重新从 heap 获取 span 切分 if s == list { s = mHeap_AllocStack(&mheap, _StackCacheSize>>_PageShift) for i := uintptr(0); i < _StackCacheSize; i += _FixedStack << order { x := gclinkptr(uintptr(s.start)<<_PageShift + i) x.ptr().next = s.freelist s.freelist = x } mSpanList_Insert(list, s) } // 从链表返回一个空间 x := s.freelist s.freelist = x.ptr().next s.ref++ // 如果当前链表已空，则移除 span if s.freelist.ptr() == nil { // all stacks in s are allocated. mSpanList_Remove(s) } return x }

从heap获取span的过程没有任何惊喜。

mheap.go

func mHeap_AllocStack(h *mheap, npage uintptr) *mspan { s := mHeap_AllocSpanLocked(h, npage) if s != nil { s.state = _MSpanStack s.freelist = 0 s.ref = 0 } return s }

简单总结一下：栈内存也从arena区域分配，使用和对象分配相同的策略和算法。只是我有些不明白，这东西是不是可以做到内存分配器里面？还是说为了以后修改方便才独立出来的？

morestack

执行函数前，需要为其准备好所需栈帧空间，此时是检查连续栈是否需要扩容的最佳时机。为此，编译器会在函数头部插入几条特殊指令，通过比较stackguard0和SP来决定是否进行扩容操作。

test.go

package main func test() { println(“hello”) } func main() { test() }

编译（禁用内联），反汇编。

$g o b u i l d - g c f l a g s " - l " - o t es tt es t . g o$ go tool objdump -s “main.test” test TEXT main.test(SB) test.go test.go:3 0x2040 GS MOVQ GS:0x8a0, CX // 当前 G test.go:3 0x2049 CMPQ 0x10(CX), SP // G+0x10 指向 g.stackguard0，和 SP 比较 test.go:3 0x204d JBE 0x2080 // 如果 SP ⇐ stackguard0，则跳转到 0x2080 test.go:3 0x204f SUBQ $0 x 10, SP // 预留当前栈帧空间 t es t . g o : 40 x 2053 C A LL r u n t im e . p r in tl oc k (SB) t es t . g o : 40 x 2058 L E A Q 0 x 6 b 4 f 9 (I P), BXt es t . g o : 40 x 205 f MO V QBX, 0 (SP) t es t . g o : 40 x 2063 MO V Q$ 0x5, 0x8(SP) test.go:4 0x206c CALL runtime.printstring(SB) test.go:4 0x2071 CALL runtime.printnl(SB) test.go:4 0x2076 CALL runtime.printunlock(SB) test.go:5 0x207b ADDQ $0x10, SP test.go:5 0x207f RET test.go:3 0x2080 CALL runtime.morestack_noctxt(SB) // 执行 morestack 扩容 test.go:3 0x2085 JMP main.test(SB) // 扩容结束后，重新执行当前函数

这几条指令很简单。如果SP指针地址小于stackguard0（栈从高位地址向低位地址分配），那么显然已经溢出，这就需要扩容，否则当前和后续函数就无从分配栈帧内存。

细心一点，你会发现CMP指令并没将当前栈帧所需空间算上。假如SP大于stackguard0，但与其差值又小于当前栈帧大小呢？这显然不会跳转执行扩容操作，但又不能满足当前函数需求，难道只能眼看着堆栈溢出？

我们知道在stack.lo和stackguard0之间尚有部分保留空间，所以适当“溢出”是允许的。

stack2.go

// After a stack split check the SP is allowed to be this many bytes below the stack guard.
// This saves an instruction in the checking sequence for tiny frames. _StackSmall = 128

修改一下测试代码，看看效果。

test.go

package main func test1() { var x [128]byte x[1] = 1 } func test2() { var x [129]byte x[1] = 1 } func main() { test1() test2() }

$g o b u i l d - g c f l a g s " - l " - o t es tt es t . g o$ go tool objdump -s “main.test” test TEXT main.test1(SB) test.go test.go:3 0x2040 GS MOVQ GS:0x8a0, CX test.go:3 0x2049 CMPQ 0x10(CX), SP // 当前栈帧 0x80 正好是 128 test.go:3 0x204d JBE 0x206e test.go:3 0x204f SUBQ $0 x 80, SPTEXT main . t es t 2 (SB) t es t . g o t es t . g o : 80 x 2080 GSMO V QGS : 0 x 8 a 0, CXt es t . g o : 80 x 2089 L E A Q - 0 x 8 (SP), A X // 当前栈帧 0 x 88 - 128 = 0 x 8 ，适当调整 SP 后再比较 t es t . g o : 80 x 208 e CMPQ 0 x 10 (CX), A Xt es t . g o : 80 x 2092 J BE 0 x 20 b 5 t es t . g o : 80 x 2094 S U BQ$ 0x88, SP

很显然，如果当前栈帧是SmallStack（0x80），那么就允许在[lo，stackguard0]之间分配。

对栈进行扩容并不是件容易的事情，其中涉及很多内容。不过，在这里我们只须了解其基本过程和算法意图，无须深入到所有细节。

asm_amd64.s

TEXT runtime•morestack_noctxt(SB),NOSPLIT, $0 MO V L$ 0, DX JMP runtime•morestack(SB) TEXT runtime•morestack(SB),NOSPLIT, $0 - 0// C a ll n e w s t a c k o nm - > g 0^{'} ss t a c k . MO V Q m_{g} 0 (BX), BXMO V QBX, g (CX) MO V Q (g_{s} c h e d + g o b u f_{s} p) (BX), SPC A LL r u n t im e • n e w s t a c k (SB) MO V Q$ 0, 0x1003 // crash if newstack returns RET

基本过程就是分配一个2x大小的新栈，然后将数据拷贝过去，替换掉旧栈。当然，这期间需要对指针等内容做些调整。

stack1.go

func newstack() { thisg := getg() gp := thisg.m.curg // 调整执行现场记录 rewindmorestack(&gp.sched) casgstatus(gp, _Grunning, _Gwaiting) gp.waitreason = “stack growth” sp := gp.sched.sp // 扩张 2 倍 oldsize := int(gp.stackAlloc) newsize := oldsize * 2 casgstatus(gp, _Gwaiting, _Gcopystack) // 拷贝栈数据后切换到新栈 copystack(gp, uintptr(newsize)) // 恢复执行 casgstatus(gp, _Gcopystack, _Grunning) gogo(&gp.sched) } func copystack(gp *g, newsize uintptr) { old := gp.stack used := old.hi - gp.sched.sp // 从缓存或堆分配新栈空间 new, newstkbar := stackalloc(uint32(newsize)) // 清零 if stackPoisonCopy != 0 { fillstack(new, 0xfd) } // 调整指针等操作 … // 拷贝数据到新栈空间 memmove(unsafe.Pointer(new.hi-used), unsafe.Pointer(old.hi-used), used) // 切换到新栈 gp.stack = new gp.stackguard0 = new.lo + _StackGuard gp.sched.sp = new.hi - used oldsize := gp.stackAlloc gp.stackAlloc = newsize gp.stkbar = newstkbar // 将旧栈清零后释放 if stackPoisonCopy != 0 { fillstack(old, 0xfc) } stackfree(old, oldsize) }

stackfree

释放栈空间的操作，依旧与回收object类似。

stack1.go

func stackfree(stk stack, n uintptr) { gp := getg() v := (unsafe.Pointer)(stk.lo) // 放回缓存链表 if stackCache != 0 && n < _FixedStack<<_NumStackOrders && n < _StackCacheSize { // 计算 order 等级 order := uint8(0) n2 := n for n2 > _FixedStack { order++ n2 >>= 1 } x := gclinkptr(v) c := gp.m.mcache

            // 如果缓存大小超出限制，则释放一些
            if c.stackcache[order].size >= _StackCacheSize {
                    stackcacherelease(c, order)
            }
            // 放回缓存链表
            x.ptr().next = c.stackcache[order].list
            c.stackcache[order].list = x
            c.stackcache[order].size += n
    } else {
            s := mHeap_Lookup(&mheap_, v)
            if gcphase == _GCoff {
                    // 归还给 heap
                    mHeap_FreeStack(&mheap_, s)
            } else {
                    // 如果正在垃圾回收期间，那么放到一个待处理队列，由垃圾回收器处理
                    mSpanList_Insert(&stackFreeQueue, s)
            }
    }

}

回收的栈空间被放回对应复用链表。如缓存过多，则转移一批到全局链表，或直接将自由的span归还给heap。

func stackcacherelease(c *mcache, order uint8) { x := c.stackcache[order].list size := c.stackcache[order].size // 如果当前链表过大，则释放一半 for size > _StackCacheSize/2 { y := x.ptr().next // 每次释放一个，它们可能属于不同的 span stackpoolfree(x, order) x = y size -= FixedStack << order } c.stackcache[order].list = x c.stackcache[order].size = size } func stackpoolfree(x gclinkptr, order uint8) { // 找到所属 span s := mHeap_Lookup(&mheap, (unsafe.Pointer)(x)) if s.freelist.ptr() == nil { mSpanList_Insert(&stackpool[order], s) } // 添加到 span.freelist x.ptr().next = s.freelist s.freelist = x s.ref— // 如果该 span 已收回全部空间，那么将其归还给 heap if gcphase == GCoff && s.ref == 0 { mSpanList_Remove(s) s.freelist = 0 mHeap_FreeStack(&mheap, s) } }

除了morestack调用导致stackfree操作外，垃圾回收对栈空间的处理也会发生此操作。

mgcmark.go

func markroot(desc *parfor, i uint32) { switch i { case _RootFlushCaches: if gcphase != _GCscan { flushallmcaches() } default: if gcphase == _GCmarktermination { shrinkstack(gp) } } }

mstats.go

func flushallmcaches() { for i := 0; ; i++ { p := allp[i] c := p.mcache mCache_ReleaseAll(c) stackcache_clear(c) } }

mgc.go

func gcMark(start_time int64) { freeStackSpans() }

因垃圾回收需要，stackcache_clear会将所有cache缓存的栈空间归还给全局或heap。

stack1.go

func stackcache_clear(c *mcache) { for order := uint8(0); order < _NumStackOrders; order++ { x := c.stackcache[order].list for x.ptr() != nil { y := x.ptr().next stackpoolfree(x, order) x = y } c.stackcache[order].list = 0 c.stackcache[order].size = 0 } }

而shrinkstack除收回闲置栈空间外，还会收缩正在使用但被扩容过的栈以节约内存。

func shrinkstack(gp *g) { if readgstatus(gp) == _Gdead { if gp.stack.lo != 0 { // 回收闲置 G 的栈空间，重新使用前会为其补上 stackfree(gp.stack, gp.stackAlloc) gp.stack.lo = 0 gp.stack.hi = 0 gp.stkbar = nil gp.stkbarPos = 0 } return } // 收缩目标是一半大小 oldsize := gp.stackAlloc newsize := oldsize / 2 if newsize < _FixedStack { return } // 如果使用的空间超过 1/4，则不收缩 avail := gp.stack.hi - gp.stack.lo if used := gp.stack.hi - gp.sched.sp + _StackLimit; used >= avail/4 { return } // 用较小的栈替换 oldstatus := casgcopystack(gp) copystack(gp, newsize) casgstatus(gp, _Gcopystack, oldstatus) }

最后就是freeStackSpans，它扫描全局队列stackpool和暂存队列stackFreeQueue，将那些空间已完全收回的span交还给heap。

stack1.go

func freeStackSpans() { for order := range stackpool { list := &stackpool[order] for s := list.next; s != list; { next := s.next if s.ref == 0 { mSpanList_Remove(s) s.freelist = 0 mHeap_FreeStack(&mheap_, s) } s = next } } for stackFreeQueue.next != &stackFreeQueue { s := stackFreeQueue.next mSpanList_Remove(s) mHeap_FreeStack(&mheap_, s) } }

另外，调整P数量的procresize，将任务完成的G对象放回复用链表的gfput，同样会引发栈空间释放操作。只是流程和上述基本类似，不再赘述。

运行时三大核心组件之间，相互纠缠得太多太细，已无从划分边界，我个人觉得这并不是什么好主意。诚然，为了性能，很多地方直接植入代码，而非通过消息或接口隔离等方式封装，但随着各部件复杂度和规模的提升，其可维护性必然也会降低。不知道开发团队对此有什么具体的想法。

HL🌱数字花园

探索

18.6 连续栈

18.6　连续栈

stackcache

morestack

stackfree

关系图谱